Error estimation and model selection

نویسنده

  • Tobias Scheffer
چکیده

Machine learning algorithms search a space of possible hypotheses and estimate the error of each hypotheses using a sample. Most often, the goal of classification tasks is to find a hypothesis with a low true (or generalization) misclassification probability (or error rate); however, only the sample (or empirical) error rate can actually be measured and minimized. The true error rate of the returned hypothesis is unknown but can, for instance, be estimated using cross validation, and very general worst-case bounds can be given. This doctoral dissertation addresses a compound of questions on error assessment and the intimately related selection of a “good” hypothesis language, or learning algorithm, for a given problem. In the first part of this thesis, I present a new analysis of the generalization error of the hypothesis which minimizes the empirical error within a finite hypothesis language. I present a solution which characterizes the generalization error of the apparently best hypothesis in terms of the distribution of error rates of hypotheses in the hypothesis language. The distribution of error rates can, for any given problem, be estimated efficiently from the sample. Effectively, this analysis predicts how good the outcome of a learning algorithm would be without the learning algorithm actually having to be invoked. This immediately leads to an efficient algorithm for the selection of a good hypothesis language (or “model”). The analysis predicts (and thus explains) the shape of learning curves with a very high accuracy and thus contributes to a better understanding of the nature of over-fitting. I study the behavior of the model selection algorithm empirically (in particular, in comparison to cross validation) using both artificial problems and a large scale text categorization problem. In the next step, I study in which situations performing automatic model selection is actually beneficial; in particular, I study Occam algorithms and cross validation. Model selection techniques such as tree pruning, weight decay, or cross validation, are employed by virtually all “practical” learners and are generally believed to enhance the performance of learning algorithms. However, I show that this belief is equivalent to an assumption on the distribution of problems which the learning algorithm is exposed to. I specify these distributional assumptions and quantify the benefit of Occam algorithms and cross validations in these situations. When the distributional assumptions fail, cross-validation based model selection increases the generalization error of the returned hypothesis on average. When several distinct learners are assessed with respect to a particular problem (or one learner is assessed repeatedly with distinct parameter settings), an effect arises which is very similar to overfitting that occurs during error-minimization processes. The lowest observed error rate is an optimistic estimate of the corresponding generalization error. I quantify this bias. In particular, I study the bias which is imposed by repeated invocations of a learner with distinct parameter settings when n-fold cross validation is used to estimate the error rate. I pursue an information theoretic approach which does not require the assumption that empirical error rates measured in distinct cross validation folds are independent estimates. I discuss the implications of these results on the results of empirical studies which have been carried out in the past and propose an experimental setting which leads to almost unbiased results. Finally, I address complexity issues of model selection. In model selection based learning, the learning algorithm is restricted to a (small) model, chosen by the model selection algorithm. By contrast, in the boosting setting, the hypothesis is allowed to grow dynamically, often until the hypothesis is fitted to the data. By giving new worst-case time bounds for the AdaBoost algorithm I show that in many cases the restriction to small sets of hypotheses causes the high complexity of learning

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OPTIMAL SELECTION OF NUMBER OF RAINFALL GAUGING STATIONS BY KRIGING AND GENETIC ALGORITHM METHODS

In this study, optimum combinations of available rainfall gauging stations are selected by a model which is consist of geo statistics model as an estimator  and an optimized model. At the  first,  watershed  is  approximated  to  several  regular  geometric  shapes.  Then  kriging calculates  the  variance &nbs...

متن کامل

Development of a Pharmacogenomics Model based on Support Vector Regression with Optimal Features Selection Approach to Determine the Initial Therapeutic Dose of Warfarin Anticoagulant Drug

Introduction: Using artificial intelligence tools in pharmacogenomics is one of the latest bioinformatics research fields. One of the most important drugs that determining its initial therapeutic dose is difficult is the anticoagulant warfarin. Warfarin is an oral anticoagulant that, due to its narrow therapeutic window and complex interrelationships of individual factors, the selection of its ...

متن کامل

Development of a Pharmacogenomics Model based on Support Vector Regression with Optimal Features Selection Approach to Determine the Initial Therapeutic Dose of Warfarin Anticoagulant Drug

Introduction: Using artificial intelligence tools in pharmacogenomics is one of the latest bioinformatics research fields. One of the most important drugs that determining its initial therapeutic dose is difficult is the anticoagulant warfarin. Warfarin is an oral anticoagulant that, due to its narrow therapeutic window and complex interrelationships of individual factors, the selection of its ...

متن کامل

بهینه‌سازی روابط دبی جریان و دبی رسوب معلق در ایستگاه‌های حوزه قره‌سو

  In this study, using Sediment rating curve models USBR, seasonal model, monthly model, data model based on separating dry and wet seasons, data separation based on flow measurement time (months of low water and high water seasons) and separation of data based on months with no green vegetation and green vegetation] on 6 hydrometric station in Gharesoo River in Golestan Province with aim of se...

متن کامل

The selection of the best from climate change model in the estimation of climatology variables for east region of the country by use fifth report data

Climate change is nowadays a major cause of concern in water related fields because it may cause more severe, shortened or prolonged droughts or floods in the future. In this research was tried to the best model of climate change is determined from the climate change models to determining the minimum temperature, maximum temperature and precipitation for the Birjand synoptic station. For this r...

متن کامل

OPTIMUM SELECTION OF NUMBER AND LOCATION OF GEOTECHNICAL BOREHOLES BASED ON SOIL RESISTANCE

Digging of geotechnical boreholes and soil resistance tests are time-consuming and expensive activities. Therefore selection of optimum number and suitable location of boreholes can reduce cost of their drilling and soil resistance tests. In this research, a model which is consisting of geo statistics model as an estimator and an optimized model is selected. The kriging calculates the variance ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • KI

دوره 13  شماره 

صفحات  -

تاریخ انتشار 1999